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O ' Abstract 

' In this work we study optimization problems subject to a failure constraint. This con- 

straint is expressed in terms of a condition that causes failure, representing a physical or 
technical breakdown. We formulate the problem in terms of a probability constraint, where 
the level of "confidence" is a modelling parameter and has the interpretation that the prob- 
ability of failure should not exceed that level. Application of the stochastic Arrow-Hurwicz 
O^I , algorithm poses two difhculties: one is structural and arises from the lack of convexity of the 

probability constraint, and the other is the estimation of the gradient of the probability con- 
I straint. We develop two gradient estimators with decreasing bias via a convolution method 

and a finite difference technique, respectively, and we provide a full analysis of convergence 
of the algorithms. Convergence results are used to tune the parameters of the numerical 
• algorithms in order to achieve best convergence rates, and numerical results are included via 

I an example of application in finance. 

Keywords. Probability constraints, stochastic programming, stochastic gradient algorithm, 
stochastic approximation 

QQ 1 Introduction 

(N 

O I 1.1 Constrained Optimization in a Stochastic Setting 

00 ■ 

■ Optimization Theory provides a convenient approach to formulate and solve problems involving 

. conflicting objectives, which is generally the challenge present in decision making situations. The 

main idea is to aggregate as many objectives as possible into a single objective function, which 
may be straightforward when those objectives are amenable to an expression into a common 
^ , unit, say, a currency unit as dollar or euro. In this objective aggregation, weights are allocated to 

^ I each term in order to reflect preferences or priorities. However, there might be other objectives 

that can hardly be expressed in a unit commensurable with the previous ones (examples to come 
hereafter). In such a case, it is better to introduce those other objectives through constraints, 
that is, each such objective should not exceed a prescribed level. The constraint levels are set a 
priori, as are the weights for the different terms in the cost function. 

Duality Theory provides the tools to evaluate the sensitivity of the optimal solution (cost) 
to those prescribed constraint levels. In mathematical terms, let u be the decision variable in 
a Hilbert space U, J : U ^ the cost function, and 9 : ^ — > M*^ the constraint function. We 
consider problems of the type: 

min J{u) s.t. B(n) < a , (1) 
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where U is an "admissible" or "feasible" closed convex subset of U and inequalities in the 
constraints involving G are understood componentwise. Introduce the multiplier A (in W^) and 
the Lagrangian 

L{u, A) = J{u) + (A, Q{u) - a) , (2) 

where (■, •) denotes the scalar product. Kuhn- Tucker optimality conditions characterize an 
optimal multiplier A^ which can be interpreted as the sensitivity of the optimal cost function 
J(u^) (where denotes the solution of problem ([T|)) with respect to a (up to a change of sign). 

When random factors affect the outcomes of a decision, a classical approach is to assume that 
the probability distribution of those factors is known and to appeal to stochastic optimization. 
Call ^ the corresponding random variable, then the objective function is usually expressed in 
terms of an expectation of some cost function of the form J{u) = E(j(n,.^)) . 

In the stochastic situation, modelling choices for aggregation of objectives, weights and con- 
straints are similar to the deterministic case. However a new question also arises regarding 
constraints, namely, constraints can be formulated in various ways: "almost surely", "in expec- 
tation", "in probability", etc. 

The first possibility ("almost sure" constraints) means that certain quantities 0(u,^) depend- 
ing on decision variables and affected by random factors should satisfy equality or inequality 
for "almost all" values of those random factors (according to their probability distributions). 
This is in particular the case of constraints which express "laws of nature" which are part of 
the mathematical model of the system under consideration. However, regarding objectives or 
"wishes" , such strict constraints are generally inappropriate from the economic or simply realis- 
tic point of view. Suppose for example that a pressure should not exceed a certain level beyond 
which death will almost surely happen. First of all, observe that it is hard to aggregate such an 
objective (actually, that to stay alive) with other more economic objectives which aim at saving 
money. Second, under the constraint that the pressure "almost never" exceeds the dangerous 
level, the operation can be extremely costly if not simply impossible. That is, some risk must 
be accepted for the operation to be economically viable. 

The second possibility (constraints "in expectation") means that, given a decision, the ex- 
pected value 0(n) = E(0(u,^)) of a critical quantity (a pressure in our example) should not 
exceed a certain level. Such a formulation is generally mathematically attractive, but it is diffi- 
cult to understand how much risk is involved in choosing such or such prescribed level. Indeed, 
given a decision u, the pressure (to keep on with our example) becomes a random variable 6{u, (,) 
with a certain distribution (which is affected by the chosen decision) , and the only thing one ask 
is that the first moment (the expectation) of this random variable stay below a prescribed level, 
but with no direct control on how much of the probability mass will lie beyond that prescribed 
level. 

The third possibility advocated to (constraints "in probability" or "probabilistic constraints" ) 
means that one accepts that the critical quantity (the pressure, say) remains under the prescribed 
level not "almost always" as earlier, but with a certain probability whose value must be chosen. 
In mathematical terms, one now considers the problem 

min E(j(u,C)) s.t. ¥{e{u,£,) < a) > vr . (3) 

This chosen probability value vr exactly reflects the risk one is ready to assume (in contrast with 
the previous approach of constraints in expectation). As discussed earlier, duality should then 
help in evaluating the sensitivity of the optimal cost function with respect to this accepted, but 
arbitrarily fixed, level of risk. 
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1.2 Quantitative vs. Qualitative Risk Measures 

We now motivate the interest of probability constraints in contrast with other measures of 
risk. Before choosing a risk measure, it is very important to know which type of failure we are 
interested in: quantitative failure or qualitative failure. For example, a power supply company 
would minimize its cost under the constraint of supplying the demand. If that demand cannot 
be fully supplied, it matters to know which percentage of it will not be covered and during which 
amount of time. This is what we mean by "quantitative failure": introducing a penalty for the 
total amount of demand not supplied directly into the cost function, or choosing to constrain a 
quantity which accounts for the amount of supply failure is appropriate in that situation. On 
the contrary, when simply going beyond a certain threshold causes death, it not does matter to 
know by which amount that threshold has been exceeded — this is what we mean by "qualitative 
failure" — but it does matter to know the likelihood of going beyond that critical threshold. 
Probability constraints are particularly adapted to this latter situation. 

In fact, because of the mathematical difficulties raised by probability constraints, these con- 
straints must be exclusively used in the case of qualitative failure problems. For quantitative 
failure problems, there are other risk measures with better mathematical properties (e.g. con- 
vexity), like Conditional Value-at-Risk (CVaR) for instance. Introduced in [13J , CVaR is one 
of the most popular risk measure in finance. CVaR is the average of a random variable for 
the worst scenarios. Denote by au{TT) the quantile function of the distribution of 9{u, •) with 
confidence level vr (also called Value-at-Risk). Then, CVaR, denoted by cpniu), is defined by 

Mu)=H0iu,O \9iu,0>aui7T)) . 

The risk constraint will be then (j)n{u) < a, where a represents the accepted level of risk, and vr 
is fixed a priori. 

Notice that the critical threshold a in the probability constraint is generally provided by 
technical considerations, whereas vr characterizes the level of risk one is ready to accept. That 
is, the decision maker may bargain about the constraint level vr but not on that threshold a 
which is a technical data. With the CVaR approach, this a disappears from the formulation 
and we believe that this is a weakness of this approach. Moreover, in the case of "qualitative 
failure" , there is no meaning in averaging values of 9 beyond a threshold which is supposed to 
be fatal. 

1.3 About this Paper 

Problem ([3]) is the class of problems considered in this paper. Its advantage is again the fact that 
the meaning of constraints in terms of risk assumed is of immediate perception. Its drawback is 
its mathematical difficulty. 

In this paper, we discuss an approach relying upon Lagrangian duality and stochastic gradient 
to solve The use of stochastic gradient is based on the reformulation of constraints in 

probability as constraints in expectation, using an indicator function. As usual with stochastic 
gradient, we assume that the functions involved in the problem (here, j and 9) are known 
explicitly but that the probability law governing the "noise" is not, or that the computation 
of expectations of the variables involved is out of reach or too costly. It is rather assumed that 
an external mechanism delivers samples of ^ which are used in the iterative algorithm. 

Writing the probability as an expectation opens the possibility of using stochastic gradient 
algorithms, but it also raises the difficulty of handling a discontinuous function, namely the 
indicator function. We will discuss various ways of overcoming that difficulty. 

The rest of the paper is organized as follows. In ^ we present the analysis of the problem, 
and our resolution strategy, a stochastic Arrow-Hurwicz algorithm. In ^ we describe two 
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structural difficulties of stochastic programming under probability constraint. To implement 
a stochastic Arrow-Hurwicz algorithm, we need to handle the probability function gradient. 
In ^ the question we are interested in is therefore: how to compute stochastic estimates of 
the probability function gradient? In order to answer this question, we propose two methods 
that allow to obtain biased stochastic gradient estimates, namely Approximation by Convolution 
(AC) ad Finite Differences (FD). We defer to a forthcoming paper to propose techniques based 
on integration by parts ideas and providing unbiased (or consistent) estimates, and to compare 
them with the biased estimates studied hereafter. 

We consider a very basic portfolio optimization problem under a probability constraint and 
use this example throughout the rest of the paper to illustrate and compare the AC and FD 
techniques. Sectional is devoted to the convergence analysis of the proposed methods. Finally, 
fJH reports numerical experiments with the Arrow-Hurwicz algorithm. 

2 Analysis of the Problem 

2.1 Review of Main Difficulties 

Probability constraints provide a straightforward risk formulation with an immediate intuitive 
interpretation. But at the same time, it is well known that such constraints raise important 
mathematical difficulties, such as the lack of convexity or connectedness of the feasible subset. 
Indeed, even if is a convex function of u for almost all values ^, the constraint in ^ may not 
define a convex feasible subset in U (which can even be not connected, if not empty). Those 
convexity or connectedness (or emptiness) properties depend of course on the properties of 9 as 
a function of its two arguments u and ^, on the probability distribution of the random variable 
^, on the level a of constraint required and on the level vr of probability required. One may refer 
to [9] for a discussion on those convexity properties, and to [8] for connectedness properties. 

In [9], the authors prove that if 9{-, •) is jointly convex in {u,^) and the probability measure 
is quasi-concave, then the feasible subset of ^ is convex. But those assumptions seem to us to 
be rather strong in practice, notably the joint convexity property. Indeed, there are numerous 
situations in which the decision variable multiplies the random variable, as in the portfolio 
problem presented in ^ or in a quite other domain, when one wants to model the breakdown 
of an actuator, in which case the random variable must be able to kill the action the decision 
variable. In all those situations, the joint convexity property is not realistic. 

2.2 Mathematical Approach for Programming under Probability Constraint 

Before explaining our resolution strategy, we review some basic results on the stochastic Arrow- 
Hurwciz algorithm [HIS]. First of all, starting with the deterministic constrained optimization 
problem ([T]) and assuming that there exists a saddle point of the Lagrangian over U^'^ x W^, 
the (deterministic) Arrow-Hurwicz algorithm consists in performing successive minimization and 
maximization steps to search for this saddle point: 




(4b) 




where H^yad is the projection onto U and H^ is the projection onto the cone M' 
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2.2.1 Stochastic Arrow-Hurwicz Algorithm 

The stochastic Arrow-Hurwicz algorithm is typically used to solve a stochastic optimization 
problem with constraint in expectation: 

min E(i(n,C)) s.t. E{e{u,C))<a, (5) 

where the calculation of expectations is basically difficult if not impossible. The stochastic 
algorithm overcomes this difficulty by simultaneously approximating the saddle point and the 
expectations by a Monte-Carlo like technique. It is in fact a combination of the idea of the 
Monte-Carlo method with the iterative procedure of gradient methods in optimization. 

We do assume that a saddle point (n", A^) over U'^'^ X exists for the La grangian associated 
with problem 1^ (hence is a solution of ([S]) ) . Observe that this Lagrangian L Q is equal to the 
expectation of £(n, A, •) = j{u, •) + {X,9{u, ■) — a). We use unbiased estimates of the gradients of L 
in u and A obtained with the corresponding gradients of £ evaluated at independent drawings 
of supposed to follow the probability law of ^. More specifically, at stage k of the algorithm, 
ti'^ and A'^' being the current estimates of the solution, 

1. we draw an independent sample (according to the law P of ^), or we observe a new 
independent sample 

2. we compute the stochastic gradients Vujiu^ ,^''~^^) and \/uO{u'',(,'^^^), 

3. we update u^^^ and A'^'^"'^ as follows: 

= ^^ad(n^-e'=(V„i(n^e'•+l) + V„0(tx^e'=+l)A'=)) , (6a) 
A'=+^ = n+(A'= + p'=(0(n^+\e^+i) -q)) . (6b) 

Under essentially measurability and convexity assumptions, assuming that the Lagrangian of 
the problem admits a saddle point, and with 

^£^^ = +00, ^(e'^)^ < +CXD (and the same for p'^) , 

fceN ken 

it is shown in [6] that this algorithm converges in the sense that primal {u'^j^gN and dual {A'^jfcgN 
sequences are bounded a.s. and that {ti'^jfceN a-s. weakly converges to some solution n" of ([5]). 

2.2.2 Mathematical Approach: Strategy and Difficulties 

From now on, we assume that the critical or risky event is defined by a single (scalar) inequality, 
that is, 9 is M-valued. Let %+ denote the indicator function of the positive half-line. The 
principle of our resolution strategy is first to replace the probability constraint by a constraint 
in expectation 

-P{U)<-TT, (7) 

where P{u) = P(^('u, ^) < q) and this probability is evaluated as an expectation: 

P(0(n,O <«)=!£(%+(« -^(n,0)), (8) 

then resort to duality, and lastly resort to the stochastic Arrow-Hurwicz algorithm. Observe 
that w.r.t. the general formulation ([T]), B is now —P and the constraint level a is now — vr. 
There are major difficulties with probability constraints. 
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• First of all, as we recalled in §2.11 convexity is not preserved. Therefore, existence of a 
saddle point of the Lagrangian is not granted; in this case, we should resort to augmented 
Lagrangian techniques to increase the chance that a saddle point does exist. However, 
this raises new problems because the nonlinearities involved in the augmented Lagrangian 
formula cannot be combined straightforwardly with expectation to yield obvious stochastic 
gradient algorithms. This issue of using augmented instead of ordinary Lagrangians goes 
beyond the scope of this paper and is not considered here. 

• We rather address here another difficulty: to replace a probability constraint by a con- 
straint in expectation, we need to handle the indicator function (see dH])); but this indicator 
function involves a discontinuity which may, nevertheless, be smoothed by the expectation 
operation; however, the stochastic Arrow-Hurwicz algorithm is based on the consideration 
of a unique sample drawn at each iteration; obtaining a stochastic gradient is therefore 
not trivial. As it will be shown later on, we propose two ways of overcoming this diffi- 
culty: Approximation by Convolution (AC) and Finite Differences (FD). Both approaches 
will lead us to consider algorithms such as ([6]) in which either a smooth approximation 
9 of function 6 will be used in both equations (j6ap and (|6bp . or an approximation of its 
gradient will be used in ()6ap , leading to a stochastic Arrow-Hurwicz algorithm with biased 
stochastic estimates of the Lagrangian gradients. 

3 Structural Difficulties of Programming under Probability Con- 
straint 

In this section, we focus on two structural difficulties of optimization problem with probability 
constraints. The first one is related to the non convexity of probability constraint: we show 
what are the consequences of this non convexity on the stochastic Arrow-Hurwicz algorithm. 
The second one concerns the behavior of the probability constraint multiplier in some particular 
cases. 

3.1 The Non Convexity of Probability Constraint 

Consider the following optimization problem 

min^(it-l)2 s.t. P(n<^)>7r, (9) 

where ^ is a normal random variable with mean value —2 and standard deviation 0.1. 

In order to point out the first difficulty, we study the qualitative behavior of the underlying 
deterministic problem, namely that in which the probability constraint is expressed with help 
of the cumulative distribution function F of indeed, P(u = ! — F{u) and therefore, the 
constraint in ([9]) can be replaced by 

1 - F{u) > vr (10) 

without of course altering the corresponding Kuhn- Tucker multiplier. 
The Lagrangian of problem @, with constraint written as in (|10p . is 

L(n, A) = i(n - 1)2 + A {F{u) - 1 + vr) ; 

and, the Kuhn- Tucker necessary conditions of optimality allow for the calculation of the solution 
which is, for example with vr = 0.7, 

J = -2.05244 and A" = 0.877913 . 
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As expected, takes the maximal possible value to satisfy the constraint, that is, the (1 — -7r)-th 
percentile of the distribution: F(— 2.05244) = 0.3, so the constraint is active, and saturated. 
Figure [U represents the Lagrangian surface in the {u, A) domain. For A = 0, we recognize 




Figure 1: Lagrangian surface 



the convex shape of the cost function only. For larger values of A, the nonconvex form of F{-) 
shows up more and more, which explains the two valleys. 

We insist on the following two points. First, our approach in this paper is based on stochas- 
tic estimates of the gradients of the Lagrangian, not on their exact computation, which we 
assume impossible. Naturally, we cannot expect the stochastic algorithm to behave better than 
its underlying average driving vector field, which we will study directly. Second, with some 
probability distributions there is a way to manipulate the constraints in order to preserve con- 
vexity. In particular with normal distributions, the map ln(l — F{-)) is concave [12], which leads 
to a convex formulation of the constraint. If we seem to overlook this remark in the following 
treatment, this is because the difficulty we try to point out in this very simple case is a fortiori 
likely to occur in more general situations when the above clever manipulations are no longer 
possible: recall that we do not assume knowledge of the distribution of 9{u,(,). 

Let us now consider the ODE associated with the Arrow-Hurwicz algorithm, 

u = -L'^{u,X) = -J'{u) - XF'{u) , (11a) 
X = L\{u,X) = F{u) -1 + -K . (lib) 

At n" = 1, the unconstrained optimal solution, one has that J'{u^) = and F'{u^) = 1.47 x lO""*^^^ 
because happens to be in the tail of the distribution. Therefore, even for very large values of 
A, L'^{u^-, A) remains very close to 0; in other words, if the (continuous) algorithm (fTT]l is started 
at (or close to) (ti". A), for practically any A, u will stay at u"! At the same time, if doesn't 
satisfy the constraint, one has that F[u^) > 1 — vr. It follows that L'^^{u^,\) > 0: A increases 
almost indefinitely! This is better illustrated by the vector field of the ODE, shown in Figure [2l 

The white zone corresponds to the basin of attraction of the optimal solution. In the grey 
zone, the algorithm is driven more or less indefinitely towards large values of A in a valley 
corresponding to the unconstrained solution u" = 1. 
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Figure 2: Vector field of the ODE 



This example shows that even a deterministic algorithm may, if started on the "wrong" side, 
wander away from the actual optimal solution. Stochastic versions of the algorithm are expected 
to behave erratically, and even if the current values of {u^,X^) are in the basin of attraction 
of the Kuhn- Tucker point, random observations may take the algorithm to other regions away 
from the optimal solution. 

3.2 Degeneracy of the Probability Constraint Multiplier 

Let us now consider the following portfolio optimization problem. This very simple problem 
allows us to point up another structural difficulty of probability constraint. This example will 
also be used in the remainder of this paper to illustrate our various approaches. 

We borrow a capital which we have to pay off at the end of the period with an interest rate I. 
We can invest a proportion u of this capital at the fixed rate b, invest a proportion v at the 
random rate ^, and finally consume the available remainder, which brings a satisfaction measured 
by a concave nondecreasing function /. We assume of course that ]E(^) > I, in other words, 
risk is rewarding. We try to maximize the sum of the satisfaction provided by consumption and 
by the expected final capital. We also want to be in a position to pay off the capital and the 
interests at the end of the period, with a probability of a least p. In this case, the optimization 
problem can be stated as follows: 

maxE(/(l -u-v) + {l + b)u+{l+ ^)v) 

u,v ' 

s.t. u>0, v>0, ii + t;<l, 

P((l + 6)^ + (1 + > 1 + ^ ^ • 



Let 



m = 



1 = 0.15, 6 = 0.2, f{x) = -x'^/2 + 2x , 
if C<^-(^: 



^^3(^y-io(^r+i5(i^) + 



if C < ^ + cj , 
otherwise , 



(12) 



where F is the distribution function. For numerical experiments, we set ^ = 0.4 and a = 3. To 
identify this problem with ([3]) , consider the equivalent minimization problem with cost function 



j{u,v,0 = -fil-u-v)-il + b)u-il+C)v. 



Let also 



P{u, v) = P((l + 6) n + (1 + V > 1 + /) 
This problem is now formulated as 



(13) 



min K(j{u,v,^)) s.t. u + v<l, —P{u,v)<—7r 

u>0,v>0 

with Lagrangian 

L(n, v; Ai, A2) = E(j(u, v, 0) + Ai(n + t; - 1) + A2(vr - P{u, v)) . 
Figure represents the optimal cost as a function of probability level vr. We observe that 

-1.25 

-1.3 
-1.35 

-1.4 
-1.45 



■1.55 



0.1 0.2 0.3 0.4 >D.5 0.6 0.7 



Figure 3: Optimal cost 



this function is not convex. In fact, it is convex for probability levels below 0.57. For probability 
levels close to 0.57, the risk of not being in a position to pay off the capital and the interests is 
important; the investment in the risky asset v decreases to zero, whereas simultaneously, that 
in the secure asset u increases. The optimal cost, which was until then a convex function of the 
required probability level, becomes concave. Above 0.65, v is zero, u is equal to (l-|-Z)/(l-|-6) = 
0.95833 in order to satisfy the probability constraint, and the optimal cost becomes constant. 

This example shows another structural difficulty of optimization under a probability con- 
straint, namely the degeneracy of the probability constraint multiplier. Indeed, for tt small 
enough, the secure asset, u is zero at optimum. For vr large enough, the risky asset, v, is zero at 
optimum. In the latter case, the event |(1 -|- 6) can only have probability 
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Figure 4: Graph of the probabiUty function for {u,v) G [0.9, 1] x [0,0.1] 
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or 1. That is to say, at f = and n = (1 + /)/(! + 6) = 0.95833, the function ()13p exhibits a 
discontinuity (see Figure . 

When vr is large and = 0, the probabihty ^T3\i can only take values or 1 (depending 
on the value of u), that is, this probability is strictly larger than the required level vr when 
the constraint is met. Clearly the constraint is not "saturated", because there is no equality, 
and consequently the corresponding multiplier is zero; small changes in vr will not affect the 
solution. However, the constraint is "active", that is, the solution {u^,v^) of the problem with 
the probability constraint is different from the solution (u*,v*) without it. 

In the remainder, in order to guarantee existence of a saddle point of the Lagrangian, we con- 
sider only probability levels below 0.57 (otherwise, one should resort to augmented Lagrangian 
techniques, but, as it has been said earlier, this issue goes beyond the scope of this paper). For 
example, for a probability level of 0.24, the primal-dual optimal solution is 

J = 0, = 0.50407 , \\ = 0, Xl = 0.08815 . (14) 

4 Stochastic Estimates of Probability Function Gradient 

As it was mentioned at §2.2.21 iii order to use a stochastic Arrow-Hurwicz algorithm, we need to 
handle the probability function gradient, that is, to obtain a stochastic estimate of the gradient 
of (see ([8])) 

P{u) = e(i^+ {a -6{u,C))). (15) 

It is well known that this gradient is difficult to compute; we may refer to [H] for a discussion 
of this topic. Recall that in our case, replacing the probability constraint by a constraint in 
expectation raises the difficulty of handling an indicator function, which is a discontinuous 
function. One way of dealing with this problem is to appeal to a technique based on convolution 
to derive a smooth approximation of this discontinuous function. Alternatively, we can obtain 
a stochastic estimate of the gradient of this function, based on a single sample drawing of ^, by 
appealing to a finite difference technique, and we rely upon the multiplication of such drawings 
along the iterative algorithm to smooth out this crude estimate. 

4.1 Approximation by Convolution Method (AC) 
4.1.1 General Theory 

The basic principle of this approach is to smooth out the indicator function appearing in ()15p so 
that differentiation underneath expectation becomes possible. Consider a smooth function h : 
M ^ M with the following properties : /i as a unique maximum at x = 0, 

/ + 00 
h{x) dx = l. (16) 
-oo 

We will give a few examples of such functions later on and will consider only functions with 
finite support although this is not absolutely necessary. With any other function : M ^ M, 
and r a small positive number, the convolution 

1 r°° /x — y\ 

Mx) = - / Hy)H jdy, 

r \ r J 

can be viewed as an approximation of (j) since h{-/r)/r approximates the Dirac function (in the 
sense of convergence of distributions) at when r tends to zero. The function (pr is differentiable 
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with 



This technique is widely known as the "molhfier" technique [7]. We now apply it to 
recall (fTSll and define 



1 fx — y\ 




+ CXD 



T 

— CO ' 



1e jf"",(ti^l±^,,. 



(here, we have used the fact that h is an even function) 

= E{pr{u,0) (17) 



with 



r Jo \ r 



Then 



{Pr)u{u,0 = ^9u{u,0 / h j dy 

J 



(19) 



r \ r 



and clearly 



P;(n)=E((p,)Un,6)- (20) 



Therefore, for any sample ()19p can be considered as a stochastic estimate of P'{u), albeit a 
biased one; however, this bias vanishes when r approaches 0. In what follows, we evaluate the 
bias and the variance of this estimate as a function of r. 

Remark 1. In the same way, according to (fT7|) . (fTSl) can be considered a biased estimate of 
P{u) whereas 

p(^i,e) = %+(a-^(^/,e)) (21) 

is an unbiased one. In Equation ()6bp of the iterative algorithm, we may either use the unbiased 
estimate or the biased one, consistently with that used in (IGah for G'. The latter option has 
the advantage of preserving the specific geometry of vector fields of Arrow-Hurwicz algorithms 
(with some symmetry, or skew-symmetry, properties, according to the point of view). The former 
option may seem preferable as long as it avoids seemingly unnecessary bias or approximation. 
Both options will be tested later on in ^ Therefore, the next theorem deals with both the 
estimates (1181) and (1191) in order to cover all variants. 



Theorem 2. The random variable (or vector) ^ is supposed to admit a density q{(,). For the 
random variable Xu{-) = 9{u, ■) depending on the parameter u, we assume that the induced 
probability law also admits a density denoted qxui^) '^''^d that this density is at least twice con- 
tinuously differentiable with first and second order derivatives. Then, for any sample drawing 
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^ following the probability density q, the expression (fTHI) provides a biased estimate of P{u) with 
a bias in O(r^) and a variance in 0(1). 

For the pair of random variables (or vectors) (Xu(-), 1^(-)) = {6{u,-),9'^{u,-)^ depending 
on the parameter u, we assume that the induced joint probability law admits a density denoted 
qXuYuix,y) and that this density is at least twice continuously differentiable in x with integrable 

first and second order derivatives. Then, for any sample drawing ^ following the probability 
density q, the expression ()19p provides a biased estimate of P'{u) with a bias in O(r^) and a 
variance in 0(l/r). 

Proof. Consider first (ll7p -( fT8]l . With the induced probabiUty law for the random variable X^, 
one has that 

Pr{u) = - / hi ]qx^{x)dydx. 

f J-oo Jo \ r / 

Using Fubini theorem and the change of variable z = {y — a + x)/r m. the integral in x yields 

;>+oo ;"+oo 

Pr{u) = / h{z) qx^ {rz -y + a) dz dy . 

Jo J-oo 

With the smoothness assumptions on qXn^ the Taylor expansion of this term for r near yields 
Pr{u)= / / h{z)(^qx„{a-y)+rzq'x^{a-y) + ^q'^^ia-y) + Oir^)z^jdzdy 

r+oo ^2 r+oo 

= Qx^ {a -y)dy+-al (a -y)dy + 0(r3) 

by using (jl6p on the one hand and by introducing 

/ + 00 
h{z) dz (22) 
-oo 

on the other hand. The term of order in r can be written as 

'lxSt)dt 



f 

J — c 



and, as such, is recognized to be equal to P(X„ < a), that is, P{u). Therefore, Pr{u) differs 
from P{u) by an O(r^) term (proportional to o"^). 

As for the variance of the estimate p8p . it is equal to the second order moment E^(p,.(ti, 0)^^ 

from which the square of E(|?r(^,C)) ™ust be subtracted. The latter is close to (P(n)^) up to 
a term of order O(r^). Therefore we concentrate on the second order moment which can be 
written, according to (fTSll . 



l2 / 

J — oo 

+00 ^ ^+oo , 2 

h{z)dz] qx„{x)dx 
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using the change of variable z = {y — a + x) /r in the integral in y, 

f +00 



< / ( h{z)dz\ qx^{x)dx 

J —CO J —00 

since h{-) > 0, 



according to (jl6p (last equality) and the fact that is a probability density. 

The proof regarding the bias of P^.{u) w.r.t. P'{u) may follow one of the following two paths: 
either a similar result is proved for the derivative of a function whenever the function itself is 
approximated by another function up to an O(r^) term; or, with (|19|) - (j2Up . we perform similar 
calculations to those we have just performed with (llTh - fllSp . Let us sketch this second path. 
Considering P^ - ipU]) and the pair {Xu{-) ,Yu{-)\ we have that 



Kiu) = I I h(J—--^^yqx^YAx,y)dxdy 



(remember y may be a vector of the same dimension as u and dy should be understood ade- 
quately). From here, we proceed as previously with the change of variable z = (x — a)/r which 
yields 

Pl[u) = - I / h{z)yqx„YArz + a,y)dzdy. 



Then, a Taylor expansion of qXuYu w.r.t. its first argument for r near yields, for the same 
reasons as previously, 

Priu) = -jy qx^Y^ (a, y)dy+'\ll ^''^^gjj"'^^ y dy + 0{r^) . (23) 

Assuming that the first term in the right-hand side above is equal to P'{u) (see Claim [3] here- 
after), we obtain again that Pr{u) differs by an O(r^) term. 

The variance is equal to the second order moment EMp^(n,^))'^ j from which we must 



subtract ^E(p^(ti, ^)) j . The latter is close to (^P{u)) up to O(r^). As for the former, we 
have that 

= ~2 j j ^^(~7~") y'^<lx^Yn{x,y)dxdy 
h^{z)y^qx^Y^{rz + a,y)dzdy. 



Prom here, we proceed as earlier with a Taylor expansion for r close to 0, and it should be clear 
that the above expression is of order 1/r with a coefficient which can be bounded by a term 
proportional to the square of the norm of h. The same consideration is still valid for the 
variance itself. □ 

Claim 3. The first term in the right-hand side of ()23p is equal to P'{u). We sketch the proof 
of this fact here. For any smooth function / : R — >• M, consider 

F(^x)=E(/(0(n,O)) = / f{9{u,0)q{0d^= j f {x) qx^{x) dx . 



14 



Then, 

F'{y) = I f'{e{u,0)e'^{u,C)q{OdC= / I f{x)yqx^Y^{x,y)dxdy. 
Integrating by parts in the integral in x, one gets 



If / is not smooth enough for this calculation to be immediately justified, one can consider a 
sequence of smooth approximations converging to / in order to establish this formula. Let us 
now use it for /(•) = ][K+(a — •). Then F{u) = P{u) and 

P'{u) = - j y j l^+{a-x)^^^{x,y)dxdy 

—-^{x,y)dx dy 

oo '^■^ 

yqxnYuia,y)dy , 

which is exactly the expected result. 
Remark 4. As a side remark, observe that 

QX^ (a) = J qXuYu (a> y) dy , 

and that 

QXr^Yuia, y)lqxS<^) = QYuiy \Xu = a), 

that is, the conditional density of knowing that = a. Therefore, the first term in the 
right-hand side of ([23l) can be written as —¥,{Yu \ = a) x qx^{a). We conclude that 

P'iu) = -qeiu,-){a) X •) | d{u, •)=«)• 

Remark 5. Observe that, although we started with the idea of a smooth function h, the 
expression (I19p of the estimate and the analysis in the proof of Theorem [2] does not involve 
more than the function h itself (not its derivatives), so that we may as well consider non smooth 
functions (and even discontinuous functions at the ends of its support). 

In conclusion, the variance of the stochastic estimate (I19p blows up like A/r as r goes to 
(where A can be bounded from above by something proportional to the square of the norm of 
h), that of p8|) remains of order 0(1), whereas the square of the bias of both estimates goes to 
as B'^r^ (where B is proportional to cr^ — see ()22p ). If the estimate of P'{u) is rather based on 
the average of expressions as (|19p for N independent drawings of ^, the variance will blow up 
as A/{Nr) whereas the square of the bias will still behave as B'^r^. Therefore, the best trade-off 
between variance and bias is realized by that r which minimizes the mean square error (MQE; 
this is the sum of the variance and of the square of the bias) equal to A/ (Nr) + B^r'^: the "best" 
r is thus {A/{4:B'^N)f^^. This yields a MQE estimated to {bA"^/^ B^/^)/{m)^/^ . Therefore, in 

the choice of function h, it is meaningful to pay attention to the quantity cr^ ||^||/2 • Remember 
B is proportional to cr^ and A is proportional to |i/i|||2- 

The bias of the AC estimate goes to with r: this parameter r allows for a trade-off 
between mean and variance which should be adapted to the number of samples available (as 
just discussed) or visited in one run in the context of an iterative algorithm, as discussed later 
on in ^5.31 
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4.1.2 Practical Aspects and Application to Example of §3.21 



Define 

1 if-l<x<l, 
otherwise. 

Table [T] proposes 6 functions with bounded supports that can play the role of function h (see 
()16|) ) and compares them from the point of view of their constants o'^''^||^||^2^ (last column), the 
relevance of which has just been explained. The column h{0) is provided to help identifying the 
functions with their graphs displayed in Figure [H 



h 


h{0) 


-I 


\\h\\h 


4/5|| , i|8/5 
< II^IIl2 


I{x) 


0.5000 


0.3333 


0.5000 


0.3701 


(1 - |xl)/(a;) 


1.000 


0.1667 


0.6667 


0.3531 


vr cos(7rx/2)/(x)/4 


0.7854 


0.1894 


0.6169 


0.3492 


3(1 - x^)I{x)/4: 


0.7500 


0.2000 


0.6000 


0.3491 


15(1 - x^fl{x)/16 


0.9375 


0.1429 


0.7143 


0.3508 


35(1 - 3;^)^/(x)/32 


1.0938 


0.1111 


0.8159 


0.3529 



Table 1: Various h functions 





Figure 5: Several possible h functions 

We observe that the fourth function, namely h{x) = 3(1 — x'^)I{x)/4 is the one to retain 
because it offers the smallest value in the last column of the table. We now apply the technique 
to the example of §3.21 again. The estimates for P^{u,v) and P^{u,v) based on this technique 
and on a given sample ^ read as follows: 

f V( ,\ I + b {1 + a) - {I + b)u - (1 + Qv -^ 

[Pr)uiu,v,0 = H ; (24a) 



l±l.( '^ + °'-"+^''-"+a° ). (24b) 



This MQE will be compared with that obtained by the next approach, namely finite differences. 

We performed some exact computations of bias and variance with the help of Mathematica 
for those estimates evaluated at the optimal solution (see (fUjl ) and with h equal to the fourth 
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function in Table [H We have found: 



E{pr)UuKvK •) = 0.62 - 0.096r^ + 0.012r'' , 
Yai{pry^{J,vK •) = 0.45/r - 0.39 - 0.05r + Oir^) , 

E{pry^{J,v\-) = i.is-o.ser^ + o.oer"^, 

YariprY^iu^v^,-) = 1-62/r - 1.39 - 0.35r + 0(r2) . 

If the estimates are based on the average over N samples, the MQE of the AC estimates are 
obtained by considering Yar{r)/N + (E(r))'^ — (E(0))^. In those expressions, we consider the 
terms in 1 /Nr and r'^ only in order to tune r as a function of A^. This computation is done for 
the sum of the MQE's related to the two components of p'^ (that is, for the mean square norm 
of the vector estimate error — we denote it MQE(r,iV)). This yields r = 1.30 N~^/^. Finally, 
we plug this value of r into MQE(r, N) to get the following function of N (again, calculations 
are exact, up to Mathematica accuracy, even if results are displayed in a truncated form): 

i§-i^ + 0(iV-«/=). (25) 

4.2 Finite Differences (FD) 
4.2.1 General Theory 

The idea here is simply to evaluate the derivative w.r.t. each component uj of the expression 
inside expectation in (fTSj) by the variation of this quantity, caused by, and divided by, the 
symmetric variation (uj + c) — {uj — c) = 2c for a sample ^. We denote Ij the vector of the same 
dimension as u with a 1 in the j-th component and elsewhere. The FD stochastic estimate of 
PL, is 

^ %+ {a-9{u + clj , 0) - %+ {a-eju- cl, , Q) ^^^^ 

2c 

It is wise to use the same sample ^ for the evaluation at n + c and u — c in order to reduce 
variance. A symmetric difference around u is also recommended. We notice that [TT] includes 
FD under a.s. continuity assumptions, which is not our case here, because the indicator functions 
are discontinuous. 

The following theorem provides the analysis of bias and variance of this estimate w.r.t. the 
parameter c. 

Theorem 6. If P (see (|15p ) is three times continuously differentiahle with hounded derivatives, 
the expression (I26p provides a biased estimate of P^. with a bias in O(c^). // 

(HI) 9{-,^) is differentiable with derivatives hounded uniformly in ^; 

(H2) the probability measure of ^ has a density; 

(H3) 9{u,-) is twice differentiable and, for all u, and for every solution ^ of 9{u,^) = a, we 
have that 9'^{u,^) ^ 0; 

then the variance of estimate (|26p is in 0(0""*^). 

// (HI) and (H2) still hold true but (H3) is replaced by 

(H4) 9{u, ■) is three times differentiable and, whenever 9{u, ^) = a for some ^, and 0^(n, ^) = 0, 
we have that 9'^2{u,C) ^ 0; 
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then the variance of estimate (I26p is in 0(c '^/^). 

Finally, under no particular assumptions on g, the best hound for the variance is in 0(c~'^). 



Proof. With the smoothness assumption on P, one has that 
^vU>{u,-)-P'{u] 



P{u + clj)-P{u-clj)-2cPi{u) 



2c 



|F:|(n) + 0(c3) 



which proves the claim on the bias. 

To evaluate the variance of ()26p . we study its second order moment which differs from the 
variance by (P^^. (^i)) up to terms in O(c^) as we have just seen. Consider 

2 



= ■^{^[{6{u + cl,,i)<a}r^{e{u-clj,i)>a}) 

+ F({e(u + clj,0 >a}n{0(n-clj,O < a} 

those two events being of course disjoint. 

Using the mean value theorem (or Taylor representation) for the function 6{-,(,) € C^, 
we have that 6{u + clj,^) = 9{u,^) + c6'^_. [v'^ {^) , ^) , and similarly 6{u — clj,^) = — 
Therefore, 

+ P ({a - (v+iO,^) <e{u,0<a + <, (^^- (0 , }) j , (27) 

Thanks to (HI), we can bound each of the above two probabilities by 

¥{{e{u,^) e {a- Kc,a + Kc]}) , (28) 

where K is the uniform bound on 9[^. . 

Our goal is now to evaluate the behavior of this probability when c is approaching 0. Let m 
be the dimension of ^; g is supposed to be M-valued. Consider any solution ^ of 

6/(u,6 = a. (29) 

The case when no such solution exists for some u will be discussed later on at Remark [3 If 
0'^{u,(^) ^ as assumed in (H3), then the manifold of solutions of ()29p is locally of dimension 
less than or equal to m — 1. The set of ^'s involved in the event in (|28p is locally a set with 
a "backbone" given by this manifold around ^, and a "thickness" which is proved to be of 
order 0(c). Indeed, with a Taylor expansion of 6{u, •) around ^, we get 



e{u,^ + y) = a + (^e'^{u,o,y) + oi\\yr). 

In this expression, we need only consider y's which are (asymptotically as c goes to 0) parallel 
to the gradient 9'^(u,S^) (that is, the component in the kernel of the linear form defined by this 



18 



gradient is useless) . It should now be obvious that to match variations of g around a which are 
of order c, we need only consider y's which are also of order c in norm. If this holds true for 
any ^ in the manifold of solutions of pop . then the probability ()28p is of order 0(c) and the 
second order moment (|27p of our estimate (and consequently the variance too) is bounded by 
an 0(c-^). 

If (H3) does not hold but (H4) does, then the same reasoning can be repeated (for a 
Taylor expansion of the next order) with y's which are now orthogonal to the kernel of the 
Hessian 6'^2iu,£,) (this component is non zero thanks to (H4)) and it should be clear that to 

compensate for variations of g of order c, we now need y's which are of order 0(c^/^) in norm. 
This also gives the order of the probability (I28p and then, the bound on the variance is in 

0(c-3/2). 

We could continue like that by removing assumption (H4) but introducing an assump- 
tion (H5), and so on and so forth. Ultimately, with no particular assumptions, (j28p is of order 
0(1) and the variance is of order 0(c~^). □ 

Remark 7. Suppose that for some u, there exists no solutions to ()29p . Then, since g is assumed 
to be at least continuous in this means that for all ^, 9{u,S^) is always either strictly less or 
strictly greater than a, in which cases P{u) (see (|15p ) assumes either the value 1 or (which 
are extreme values for P). 

If 9{u, •) can be bounded away from a, then the probability (|28p will be for c small 
enough. This is the good case for the variance of the estimate. But 0{u, •) may also approach 
a asymptotically, and, with heavy tails for the density q of ^, it is not possible to give a better 
bound for ()28p than 0(1). Here is an example. Let 0{u,^) = u — {u and ^ are both 
scalar). Consider the probability P(0(O, ^)) € [— c, c] for c small, that is, P(^ > — Inc). Assume 
the density g(^) has a positive support and that it is equal to a(l + ^)~(i+") I]g+(^) with a an 
arbitrary small positive number. Then P(^ > — Inc) = (1 — lnc)~". For c positive and below e, 
(1 — In c)^^ > c, hence this probability is larger than c"". Since a is positive and arbitrarily small, 
we cannot clearly make this case enter the case of better bounds obtained with assumptions 
(H3) or (H4). 



4.2.2 Application to the Example and Comparison with the AC Method 

We have used (|26p (for the two components of the gradient, that in u and that in v) to our 
example and evaluated, once again with the help of Mathematical the mean and variance of 
those estimates at the optimal solution ()14p . The results are as follows: 

¥.v{p{u\v\ ■) = 0.62 - 0.23c^ + 0.06c'' + O(c^) , 
VarV;ip(n'', •) = — 0.39 - 0.12c + 0{c^) , 

EVSp(n'', f ^ •) = 1.18 - 1.49c2 - 42.25c^ - 199.41c'^ + 0(c'^) , 
NaiV{p{u\v\ •) = ^ - 0.39 - 0.74c + 0{c^) . 



Following the same procedure as for the AC estimate, the MQE for the gradient vector estimate 
based on N independent samples is obtained by Var(c)/A^ + (E(c))^ — (E(0))^; in this expression, 
the dominant terms in 1/Nc and in c^ only are retained to tune c as a function of A^. This yields 
c = 0.63A^^^/^ and an optimal MQE equal to: 

l^^l^^OiN-'^'"). (30) 
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Compared with ()25p which was obtained with the AC estimate, this is asymptotically slightly 
better. However, a more careful inspection with complete expressions of the MQEs shows that 
this conclusion becomes true only for N above about 11000. Hence one may say that the AC 
and the FD methods yield approximately the same performances. 



5 Convergence Analysis 
5.1 Stochastic Algorithms 

Consider algorithm With Q{u) = E(^(n,,^)) and J{u) = Ej(u,^), an equilibrium point 

(m", A^) of this algorithm solves the system of Kuhn- Tucker optimality conditions of problem ([T]): 
for all positive e and p, 

J = Hf^ad (u^ - e{VuJ{J) + V«e(n») A»)) , (31a) 
A» = n+(^A« + p(e(n«) -q)) . (31b) 

We will write algorithm ([6]) (with p'' proportional to e^) more compactly: we set x = {u, A) and 
write 

x'=+^ = n(x^'-e^>^'), (32) 

where H stands for the projection operation on U^'^ x and ip^ is driven by an underlying 
process of i.i.d. drawings S,^^^, independent of {x*}j<fc. Let J^'^ be the filtration generated by 
{x^, {C}i<k} so that Tp^ and x'''^^ are J^^+^ measurable. 

With the stochastic estimates produced by the AC and FD techniques considered so far 
in this paper, we obtained biased estimates of V^O (and the bias sometimes also affects the 
estimate of itself), with a bias going to as A; ^ +oo. We will denote "^{x'') the correct value 
of the vector field at x^, namely 

V J(u) + Ve(u)A , (33a) 
a - eiu) , (33b) 



that with which an equilibrium point satisfies (see ()3ip ): 

x^ = U{x^ -e^{x^)) (34) 

for all positive e. 

Define the martingale difference AM^, the bias and the variance of {V''^} by: 

am'' = 4)'' - E(^*-' I J^'=) , (35a) 
5^= = E(V''' I -F^) - ^-(3;^) , (35b) 
V'' =K\\ij'' -Kii;'' . (35c) 

We will use references |10j and [TT] in which convergence results and convergence rates of algo- 
rithm ()32p are provided. Essentially, if the nonlinear projection operation at the r.h.s. of (I32p is 
missing, under conditions on the quantities (j35p in connection with the step size e'^ that we will 
recall below, the trajectory produced by (I32p behaves a.s. as that of the deterministic ODE: 

x = -^{x). (36a) 

In the presence of the projection onto a closed convex set, the differential equation is more 
complex to write since it involves another process z in which z takes values in the orthogonal 
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cone C{x) to the convex set at the current point x (hence this process effectively appears only 
at the border of the convex set). The ODE now reads 

x = -q,(^x)-z, zeC{x). (36b) 

The role of z is to maintain x in the convex set, as it is the case for x^ produced by (|32p . It is 
defined as the "minimum force" which achieves this goal. 



5.2 Convergence 

In this subsection, we recall the conditions which ensure that the stochastic Arrow-Hurwicz 
algorithm will behave as its ODE (|36|) and we refer to the previous subsection to deduce that 
primal iterates will converge, at least locally, towards the solution (assumed unique) of the 
constrained optimization problem. We then apply those results to the case of biased gradient 
estimates provided by AC and FD methods to derive a policy on how to tune the parameters r 
(see (|19|) ) and c (see (|26p ) as functions of the iteration index k in order to satisfy the convergence 
conditions. 

Lemma 8. Consider the iteration (l32]l and assume that 



^ = +00 , (37a) 

k 

^e'^ll^'^ll < oo a.s., (37b) 

k 

Y,i^''fVk<oo. (37c) 



k 



Then, a.s., x has the same asymptotic behavior as the solution of l\36h . 

This result follows from [TOj Chap. 5]. 

Proposition 9. Consider the case when the estimate (I19p (and possibly (llSp too) is (are) used 
in the stochastic algorithm ([6]) (with proportional to e^) with the following choices of the 
stepsize and of the "mollifier" parameter r^ : 

= k-\ r^' = k-f^!^ , (38) 

for (3 and 7 positive. Then the conditions of Lemma\^ are satisfied if 

7<1, /3 + 7>l, 27-/3/2 >1. (39) 

Proof. The first condition (I39p is required by (I37ap . Theorem [2] states that the bias of AC 
estimates is in 0{{r^f) = 0(A:-^), hence e^\\B^\\ = 0{k-^l^+^^); therefore (f37b]l is satisfied 
under the second condition (|39p . As for the variance ^ it is in 

0((^fc)-i) = 0(A;/3/2) which 

yields {e^)^ = 0(^:^^/2-27^ . ^j^^g ([37a is satisfied under the third condition ([39]). □ 

Proposition 10. Consider the case when the estimate (j26p is used in (|6ap (with in ()6bp 
proportional to e^) with the following choices of the stepsize and of the FD parameter c^ : 



e 



for P and 7 positive. Then the conditions of Lemma\E\ are satisfied if, in addition of assumptions 
(HI) and (H2) of Theorem\^ one has that 



7<1, /? + 7>l, { 



27 — /?/2 > 1 if (H3) is satisfied in Theorem\^ 
27 - 3/3/4 > 1 if (H4) is satisfied in Theorem\^ (41) 
^ 27 — /? > 1 otherwise. 
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The proof follows the same pattern as previously using the evaluations of Theorem [U the 
only changes concerning V''. 

5.3 Convergence Rate 

Let (3 and 6 be the integers such that: 

= 0{k-^), = Oik-^) . (42) 

Reference [11] provides a comprehensive analysis of the convergence rates of algorithms of type 
([32|) under the following assumption: close to its unique equilibrium point (supposed to lie in 
the interior of the convex set onto which 11 is the projection), function ^ admits the following 
representation: 

^(x) = ^(x-x») + 0(||x-x*f), (43) 
where A is a matrix with eigenvalues /x satisfying 

Then, a direct application of Theorem 3.1 in [llj gives the asymptotic mean square error (MSE) 
as a function of the algorithm parameters 7, /?, 5 and it states that: 

E(x*^ -x«)2 = 0(fc-''), K = min(2/5,7 + 5). (45) 

Some comments are in order here regarding the application of this result to our situation. 
First, the authors of [H] state than when x" lies on the boundary of the feasible convex set, other 
techniques (e.g. large deviations) are required to establish convergence rates. In our case, we 
expect that the probability constraint is active at the optimum, hence the optimal dual variable 
should be strictly positive. In the example of §3.21 we also have positivity constraints on primal 
variables and that on u (the first primal component) is active at the optimum (see (|14|) ). But it 
is felt that the projection is rather helpful in accelerating convergence for this component (see 
numerical results in the next section). We may consider that, asymptotically, is "frozen" at 
and does not participate to the dynamics of the algorithm ultimately. 

Second, condition (|44p may not be satisfied. We will come back on this point in the next 
subsection. Nevertheless, we used the results of [TT] as guidelines for the choice of parameters /? 
and 7 to drive the primal solution to its equilibrium in the most efficient way. 

That said, in order to achieve the fastest convergence rate, one should seek to maximize k 
in (I45p over the feasible set defined by (I39p or (I4ip and the expression of (5 as a function of /?. 
For the case of AC estimates, 5 = — /3/2, the minimum of 2/3 and 7-/3/2 is obtained when 
those two functions are equal, which yields P = 27/5 and a value of 47/6; because of the first 
condition (j39p . the maximal possible value is obtained with 7 = 1, which yields /? = 2/5 and 
K = 4/5, and we check that this pair (/?, 7) satisfies all conditions in ()39p . Observe that our 
heuristic reasoning at the end of ^4.1.11 and ^4.1.21 in order to tune the parameter r when 
i.i.d. samples are available (here is the iteration index k) yields the same results (see ()25p in 
particular). 

For the case of FD estimates, under assumption (H3) of Theorem [6l the calculations and 
conclusions are the same. Under assumption (H4), 5 = — 3/?/4 and the optimal values are 
f3 = 4/11, 7 = 1, K = 8/11 which is of course worse than the previous case. Finally, in the worst 
case for FD, we get P = 1/3, 7 = 1, k = 2/3. 

The following result is a direct application of [IH Th. 4.1 and 4.2]. This CLT gives additional 
information on the asymptotic behavior of the iterates of (I32p . 
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Theorem 11. Consider algorithm (I32p with assumptions (143 p . ()42p and e'^ = l/Zc ('i/zai is, 
7 = 1 in (13SI) or (00]);. Lei 

wii/i K as in (jiSl). If 213 > 1 + 5, then as k ^ oo, X'' - k^^'^-'^HbB converges in distribution 
towards a normal distribution of mean and covariance S where: 

B = lim kf^B^ , 

fc^oo 

Hb = A- 131, 

H = A-{il + 5)/2)l, 

R = lim k^E{AM''{AM''y \ T^) , 

t.h + h'^t. = r, 

where denotes transposition. 

Remark 12. From the definition of Hi, above and the appearance of A — ((1 + 5)/2)I in the 
definition of S, it is apparent that the strong stability condition (j44p (case 7 = 1) ensures that 
both these matrices are positive definite, so that S is well defined. Indeed, with our choices, 
and H are equal. 

5.4 The Case of Arrow-Hurwicz Algorithms 

We know discuss the properties of matrix A in the situation of Arrow-Hurwicz algorithms. This 
matrix has been introduced in (143 p in general, and the operator ^ is defined by ()33p in our case. 
Thus, A is the linearized version of that ^ at the equilibrium point x", that is. 

However, among the constraints 0, only those saturated (that is, satisfied with equality) at the 
equilibrium point should be taken into account together with their corresponding multipliers 
(that is, the non saturated constraints are virtually absent asymptotically). 

Under the assumptions that the gradients of saturated constraints are linearly independent 
(or, otherwise stated, the operator in the upper right-hand corner of the matrix is injective), 
and that the Hessian of the Lagrangian (that is, the operator in the upper left-hand corner) is 
positive definite, it can easily be proved that the real part of the eigenvalues of A are positive 
(see [2| proof of Proposition 4.4.2]). This is condition ()44|) in the case 7 < 1. When 7 = 1, 
condition (j44p is stronger and will be discussed shortly in the case of our example. Observe that 
if we assume that the only saturated dualized constraint is the probability constraint (which is 
the case in our example), then we should assume that the gradient of this probability function 
at the equilibrium is not zero. 

Going back to example of §3.21 matrix A (restricted to the variables (u, A2)) is equal to 

0.944 1.002 -0.621\ 
1.002 1.211 -1.181 
0.621 1.181 / 

with eigenvalues 0.974 it 0.753 i and 0.207. As predicted, the real parts are positive but the 
smallest one is equal to 0.207 which is not greater than 2/5. Thus, condition (j44p (case 7 = 1) 
is not satisfied (with [3 = {1 + 5)/2 = 2/5). However, in the same way as we ignored multipliers 
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corresponding to non saturated constraints because they are stuck to asymptotically, we may 
consider that the part u of primal variables is "out of the game" ultimately because u is stuck to 
(the constraint n > is saturated) at the end of the transient part of the algorithm (remember 
that the ODE (j36ap is to be replaced by the more complex dynamics ()36bp when following 
boundaries of the admissible domain). Therefore, we consider a reduced matrix A by keeping 
only the 2x2 lower right-hand block (corresponding to the pair {v,\2))- The eigenvalues of this 
reduced matrix are 0.605 it 1.014 i and now condition ()44p is satisfied even for the case 7 = 1. 

6 Numerical Results 

Algorithm ([6]) has been used to solve the example of §3.21 with the AC and FD estimates. 

u^+^ = Iijj^^{u^ - e''[Vu3{u\e^^) - ^{u\e^^) X'')) , (47a) 
A'^+i = n+(A'^ + /(^-P(n^+\e''+i))) . (47b) 

More precisely, for the AC method, S/uP{u,^) should be interpreted as the gradient esti- 
mate ([TQ]) . applied to the example (see (f2l]l ): P{u,^) is either p{u,^) as in (f2T|) or the biased 
estimate given by (llSp (see Remark [TJ. We tested both versions numerically and there was no 
significant difference. The estimate (|18p was retained for the rest of experiments. Of course, 
parameter r'^ is adjusted according to the rule r'^ = ak~^/^ where a is a positive constant to be 
tuned. 

For the FD method, VuP{u, €) is given by (f26l) . applied to the example. Again, parameter 
is adjusted as = hk~^^^ where 6 is a positive constant to be tuned. For P{u,S,), we used ([2T]) . 
Numerical experiments are performed according to the following protocol: 

• all runs of the algorithms start from the same initial conditions: 

u° = 0.2, v° = 0.8, A? = 0.5, A^ = 0.3 . 

Recall that the solution is given by (I14p and all results will be expressed in terms of 
differences with those optimal values (hence the equilibrium point for all variables is at 0). 

• For AC and FD, 100 runs of the algorithms are performed using the same 100 sequences of 
pseudo-random numbers to generate Monte Carlo samples of ^ according to the distribution 
of this variable. 

• 5000 thousands iterations are performed for each run. 

• For AC and FD, averages of the differences — x^ are computed over the 100 runs together 
with their standard deviations. What will be shown on the plots are the trajectories of the 
"average it standard deviation" of those quantities as functions of the iteration index k. 

• The parameters a, b, d, e, /, g appearing in the following rules: 

U CL h b u d h f 



fci/5' A;i/5' e + k' g + V 

are tuned by some trials to try to obtain the "best" results for both methods. 

Figure [6] shows the plots for the four variables and for the AC (continuous line) and DF (dotted 
line) methods. Again what is displayed is the "average it standard deviation" over 100 runs. 
Results obtained on this example are very close (with maybe a slight advantage to FD in the 
earliest iterations) with both methods. This confirm the estimation of variance and bias made 
with Mathematica around the optimum for the estimates obtained with the two methods. 
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Figure 6: Average it standard deviation for AC (solid line) and FD (dotted line) algorithms 



7 Conclusions 

This paper discussed the problem of stochastic optimization under probability constraints and 
in particular methods for solving them numerically. Although there exist other ways of tak- 
ing care of risk considerations in decision problems under uncertainty, we discussed the fact 
( §1.2p that probability constraints are sometimes the most straightforward way of expressing 
and quantifying risk in some circumstances. 

Unfortunately, as shown by the discussion and examples in ^ probability constraints may be 
the source of several pathologies, and the loss of convexity is the most frequent one. Nevertheless, 
one must address the problem of numerical resolution with approaches which may fail in the 
worst cases but which may also succeed to solve nontrivial problems. Our strategy is based on 
duality and stochastic gradient algorithms. Duality, and the use of stochastic Arrow-Hurwicz 
algorithms, require the existence of a saddle point of the Lagrangian, which is not granted for 
the reasons advocated above. The use of augmented Lagrangians would certainly increase the 
chance of existence of saddle points but, in combination with stochastic algorithms, it raises new 
difficulties (namely, the operator of mathematical expectation would appear inside a nonlinear 
function). This new topic will be addressed in a forthcoming paper. 

Apart from this problem of saddle point existence, the search of this saddle point by stochastic 
gradient algorithms is made possible by expressing the probability constraint as an expectation 
involving a discontinuous function. In this paper, we proposed two ways to overcome this 
difficulty, and we studied the convergence and convergence rate of the resulting algorithms. 
The two methods provide biased stochastic estimates of the constraint gradient. Although 
their implementation on a simple example showed a similar behavior, the theoretical results 
reveal that in more general situations, the "mollifier" (or "Approximation by Convolution" — 
AC) method should be of more general use and robustness then the "Finite Difference" (FD) 
method. We defer to a forthcoming paper to propose other estimation techniques providing 
unbiased estimates and based on techniques of integration by parts. 

Still, the surface of this difficult field of numerical resolution of probability constrained 
stochastic optimization problems has been just scratched here, and several directions remain 
open for future investigations. For example, we have considered here only events (whose prob- 
ability is constrained) which are described only by a scalar constraint and the case of events 
described by multidimensional constraints may raise new questions (although the techniques 
discussed in the present paper seem ready for an extension to this case). 
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